A polynomial-time approximation to optimal multivariate microaggregation
نویسندگان
چکیده
Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released without disclosing private information on the underlying individuals. Microaggregation techniques are currently being used by many statistical agencies. The principle of microaggregation is to group original database records into small aggregates prior to publication. Each aggregate should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. In addition to it being a good masking method, microaggregation has recently been shown useful to achieve k-anonymity. In k-anonymity, the parameter k specifies the maximum acceptable disclosure risk, so that, once a value for k has been selected, the only job left is to maximize data utility: if microaggregation is used to implement k-anonymity, maximizing utility can be achieved by microaggregating optimally, i.e. with minimum within-groups variability loss. Unfortunately, optimal microaggregation can only be computed in polynomial time for univariate data. For multivariate data, it has been shown to be NP-hard. We present in this paper a polynomial-time approximation to microaggregate multivariate numerical data for which bounds to optimal microaggregation can be derived at least for two different optimality criteria: minimum within-groups Euclidean distance and minimum within-groups sum of squares. Beyond the theoretical interest of being the first microaggregation proposal with proven approximation bounds for any k, our method is empirically shown to be comparable to the best available heuristics for multivariate microaggregation. c © 2007 Elsevier Ltd. All rights reserved.
منابع مشابه
Optimal Multivariate 2-Microaggregation for Microdata Protection: A 2-Approximation
Microaggregation is a special clustering problem where the goal is to cluster a set of points into groups of at least k points in such a way that groups are as homogeneous as possible. Microaggregation arises in connection with anonymization of statistical databases for privacy protection (k-anonymity), where points are assimilated to database records. A usual group homogeneity criterion is wit...
متن کاملRepeated Record Ordering for Constrained Size Clustering
One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...
متن کاملImproved Univariate Microaggregation for Integer Values
Privacy issues during data publishing is an increasing concern of involved entities. The problem is addressed in the field of statistical disclosure control with the aim of producing protected datasets that are also useful for interested end users such as government agencies and research communities. The problem of producing useful protected datasets is addressed in multiple computational priva...
متن کاملA Polynomial Algorithm for Optimal Microaggregation
− Microaggregation is a technique that is used by statistical agencies to limit disclosure of sensitive microdata. Noting that no polynomial time algorithms are known to microaggregate optimally, Domingo-Ferrer and Mateo-Sanz have presented heuristic methods based on hierarchical clustering and genetic algorithms to identify sub-optimal solutions. We present an efficient polynomial time algorit...
متن کاملPractical Data-Oriented Microaggregation for Statistical Disclosure Control
ÐMicroaggregation is a statistical disclosure control technique for microdata disseminated in statistical databases. Raw microdata (i.e., individual records or data vectors) are grouped into small aggregates prior to publication. Each aggregate should contain at least k data vectors to prevent disclosure of individual information, where k is a constant value preset by the data protector. No exa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computers & Mathematics with Applications
دوره 55 شماره
صفحات -
تاریخ انتشار 2008